A short description of the post.
Since 20 years ago, GASTech has been operating a natural gas production site in the island country of Kronos. The business has been profitable, and the company has also developed close relationships with the Kronos Government. In January 2014, following GASTech’s initial public offering listing, several GASTech employees has gone missing. An organization known as Protectors of Kronos (POK) is suspected in these cases of missing persons, as GASTech’s business moves had not been too environmentally friendly. A thorough investigation is to be carried out by the law enforcers of Kronos and Tethys to break this case.
Information and data pertaining to the whereabouts of company cars, purchases made by employees in local stores have been provided to the law enforcers. We shall use visual analytics to sense-make this data to facilitate the investigation.
This would be done as a sub-component which would eventually feed into an interactive Shiny app for the use of the law enforcers, together with other sub-components covered by my project group mates. The objective of this assignment is to explore what the insights are and how they can be brought out from the depths of this dataset to aid in the investigation.
Crime analysis is a law enforcement function which involves systematic analysis for identifying and analyzing patterns and trends in crime and disorder [Wikipedia]. Too little data would inevitably limit the efficiency of the investigation, but overwhelming volume of information could pose a huge challenge as well. Coupled with the need for rapid analysis, too much information to absorb, categorize, remember and draw meaning from could compromise the overall investigation [Data Mining and Predictive Analysis, Colleen McCue, 2007]. Visual analytics techniques could be employed to gain useful insight from massive raw data.
For efficiency in data processing, information must first be consistent as subtle differences can greatly increase variability and reduce the reliability and value of a dataset [Colleen McCue, 2007]. Next, there cannot be information overload within a diagram. Good practices such as appropriate brushing and linking, selecting and marking, aggregation, elimination, virtual navigation techniques such as zooming, focus + context, and details-on-demand techniques have been studied and used to overcome an over-cluttered screen.[Visual Analytics for Crime Analysis and Decision Support, Ku et. al, 2016]. Uninteresting and expected patterns can also be unmarked to improve efficiency and reduce false positives. [Arxiv, Visual of Anomalous User Behavior ]
According to [Arxiv, Visual of Anomalous User Behavior], detection of anomalous user behaviors can be a challenging task as the boundary between abnormal and normal data may not be clearly defined, and approaches like machine learning lack contextual information to support decision-making. Visualization techniques like sequence visualization, graph visualization, text visualization, geographic visualization, chart visualization can be combined with interaction methods like tracking and monitoring, pattern discovery, exploration and navigation to analyze anomalous user behaviors.
Analysis of anomalous travel behaviors can take the following approaches:
Analysis of anomalous transactions can take the following approaches:
According to Robert Krueger’s dissertation on Visual Analytics of Human Mobility Behavior, movement data is more complex to handle than simple point-based data as it contains complex hierarchical structures of overlapping trajectories with diverse shapes and directions.
Movements can be spatially aggregated. A full spatial and temporal aggregation of the trajectories can result in a static graph G = (V,E) consisting of nodes V and edges E. Each edge e = (u,v) can encode directions and contain a weight that holds the travel volumes between the nodes. Analysis is flexible with this graph network and techniques such as clustering, segmentation, aggregation can be performed.

This is the world famous cholera map produced by Doctor John Snow in 1854. Each bar plotted onto the map represents a death case. It was then immediately apparent where the deaths clustered, enabling investigation to be directed and focused. It eventually led them to the culprit water pump in Broad street which was polluted by sewage water tainted by a disposed baby nappy with cholera. Death statistics on their own might not have led to this discovery this soon had it not been geo-localized. This spatial autocorrelation is powerful.
Visualizations can include many types such as thematic maps, scatter plots, parallel coordinate plots, timelines and a wide range of other techniques. Interactivity to allow quick switching between these views can facilitate more insights. These systems are described generally as coordinated-view visualizations. An example is shown as below. Global population trends by country are compared using a parallel coordinate plot, choropleth map and treemap. (Image courtesy of the National Center for Visual Analytics at Linkoping University)

There can be perspective distortion and occlusion, but the spatio-temporal distribution can be highlighted. [R. Krueger]. Space-time cubes show change over time within geographic space. Each cube represents a slice of time, in which the topmost cube has the newest timestamp. Temporal changes in that geographic area can then be visualized. Map below shows a space-time cube web scene in ArcGIS Online (AGOL).

The following information is available.
The type of data required would vary with the questions. This section would only cover the main data set up, EDA, quality of data, any general manipulations such as correcting the format of the data.
Special data manipulation specific to the questions would be covered in their respective sections instead.
We first load the datasets, via the read_csv() function.
car_assigned <- read_csv("data/car-assignments.csv")
gpstracking <- read_csv("data/gps.csv")
loyaltycard <- read_csv("data/loyalty_data.csv")
creditcard <- read_csv("data/cc_data.csv")
emprecords <- read_csv("data/EmployeeRecords.csv")
To check for missing values, we use the naniar package. The naniar package provides tidy ways to summarize, visualize and manipulate missing data.
#For car_assigned
miss_var_summary(car_assigned)
# A tibble: 5 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 CarID 9 20.5
2 LastName 0 0
3 FirstName 0 0
4 CurrentEmploymentType 0 0
5 CurrentEmploymentTitle 0 0
#For creditcard
miss_var_summary(creditcard)
# A tibble: 4 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 timestamp 0 0
2 location 0 0
3 price 0 0
4 last4ccnum 0 0
#For gpstracking
miss_var_summary(gpstracking)
# A tibble: 4 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 Timestamp 0 0
2 id 0 0
3 lat 0 0
4 long 0 0
#For loyaltycard
miss_var_summary(loyaltycard)
# A tibble: 4 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 timestamp 0 0
2 location 0 0
3 price 0 0
4 loyaltynum 0 0
Here, we observe that the only dataset with missing values is car_assigned. From the filter below, we see that the truck drivers are not assigned cars. For now, other than to acknowledge this fact, we are indifferent to the missing values as there seems to be no other dataset which contains LastName and FirstName to be able to use the information below. Later, we would see that there is location tracking information on vehicles not within the cars list. These vehicles can be associated with any of the truck drivers below.
For now, these rows can also be removed since we will be doing a join with the other datasets and would require a uniquely valid column without missing values. They will be removed via the complete.cases() function.
filtered_car <- car_assigned %>%
filter(is.na(CarID))
car_assigned_only <- car_assigned[complete.cases(car_assigned),]
Next, we shall ensure that the Timestamps are in the right and consistent format, using the lubridate package. For the loyaltycard dataset, the timestamp is only in mdy format. We will observe that the timestamp will be converted to POSIXct (for creditcard and gpstracking dataset), and Date (for loyaltycard as there is only date data).
#For creditcard dataset
creditcard$TimeStampFormatted <-mdy_hm(creditcard$timestamp)
#Delete timestamp column
creditcard <- creditcard %>%
dplyr::select(-timestamp)
#Reorder columns
col_order <- c("TimeStampFormatted", "location","price","last4ccnum")
creditcard <- creditcard[, col_order]
#For gpstracking dataset
gpstracking$TimeStampFormatted <-mdy_hms(gpstracking$Timestamp)
gpstracking$hour <- hour(gpstracking$TimeStampFormatted)
#Delete timestamp column
gpstracking <- gpstracking %>%
dplyr::select(-Timestamp)
#Reorder columns
col_order <- c("TimeStampFormatted", "id","lat","long","hour")
gpstracking <- gpstracking[, col_order]
#For loyaltycard dataset
loyaltycard$TimeStampFormatted <- mdy(loyaltycard$timestamp)
#Delete timestamp column
loyaltycard <- loyaltycard %>%
dplyr::select(-timestamp)
#Reorder columns
col_order <- c("TimeStampFormatted", "location","price","loyaltynum")
loyaltycard <- loyaltycard[, col_order]
We also observe special unidentifiable characters in Katerina’s Cafe in the creditcard and loyaltycard dataset. Those shall be identified and replaced using the str_replace_all() function to prevent error in data processing.
#creditcard
creditcard <- creditcard %>%
mutate(location = str_replace_all(location,pattern = "Katerin.+",replacement = "Katerinas Cafe"))%>% mutate(location = str_replace_all(location,pattern = "[^[:alnum:]]",replacement = " "))
#loyaltycard
loyaltycard <- loyaltycard %>%
mutate(location = str_replace_all(location,pattern = "Katerin.+",replacement = "Katerinas Cafe"))%>% mutate(location = str_replace_all(location,pattern = "[^[:alnum:]]",replacement = " "))
Next, we include into the gpstracking dataset, the first and last names of the personnel the car is assigned to. We essentially want to do a left join for the gpstracking dataset, and the car_assigned dataset, by the car ID. We can do this with the left_join() function, so that we keep all rows in the gpstracking dataset.
gpsname <- left_join(gpstracking, car_assigned_only, by = c("id" = "CarID"))
#Join First Name and Last Name
gpsname$name <- paste(gpsname$FirstName, gpsname$LastName)
#Reorder cols
col_order <- c("TimeStampFormatted", "id","lat","long","hour","name","LastName","FirstName","CurrentEmploymentType","CurrentEmploymentTitle")
gpsname <- gpsname[, col_order]
Next, we shall prepare the geospatial map for viewing. We would use the Raster package to import the raster file for the map of Abila.
The file to be imported is already georeferenced using qGIS, into .tif format. Note that the raster layer is a three bands false colour image, we would use tm_rgb() instead of tm_raster() to be able to display all three bands. If not, the layer would come out in monochrome.
#Importing the Raster file
bgmap <- raster("data/MC2/MC2-tourist_modified.tif")
#Plotting the Raster Layer and defining as base layer.
tmap_mode("view")
tmain <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255)
Then, we shall map the aspatial data next. Essentially, the following general steps are required to be able to create layers to visualize on the map:
Select the latitude and longitude coordinates that we want displayed. These coordinates are in the .dbl format.
Convert it to Simple Feature Data Frame via the st_as_sf() function of the sf package; Coordinates would be converted to geometry format. They would be input as longitude (‘long’ ; x-coordinates) and latitude (‘lat’ ; y-coordinates), in the EPSG: 4326 format, which is the wgs84 Geographic Coordinate System.
Example:
gps_sf <- st_as_sf(gpstracking,
coords = c("long", "lat"),
crs = 4326)
We can either use them as individual geometry points, or string them up to form a path
To form a path, we use the st_cast(“LINESTRING”) function
Example:
(Here, we are creating the movement path from GPS points for each car. Hence, we need to group the data by the car ID, the identifier. As R requires a command following the group_by() function, an input will be required for the code to run, so we include a dummy summarize() code to overcome this issue.)
gps_path <- gps_sf %>%
group_by(id) %>%
summarize(m = mean(TimeStampFormatted),
do_union = FALSE) %>%
st_cast("LINESTRING")
We would like to find out where the vehicles have gone to. First, we have to identify the coordinate points where the vehicles have possibly made stops, and be able to visualize where these stops are on the map. We could decipher this information from the gpstracking data, which tracks the coordinate points of vehicles as long as they are moving. This means that when there is a long gap in the timestamp at a particular coordinate point, it is likely that the vehicle has parked. Stops at traffic light junctions should take no longer than 2 minutes. As such, we shall assume for vehicles which are stationary for more than 4 minutes to be parked (i.e. search for timestamps with lag of more than 4 minutes).
In addition, we note that for position coordinates, the number of decimal places required for a particular accuracy at the equator is:
Considering the sizes of typical carparks, the accuracy of 1.11m would be too precise. A more likely range could be 11.1m. With this, we shall round all our coordinate points up to 4 decimal points for analysis.
A map with all stops identified from the entire gpstracking dataset is as shown.
#Round lat long points to 4 decimal points
gpstracking$lat <- round(gpstracking$lat, 4)
gpstracking$long <- round(gpstracking$long, 4)
#Identifying all stops for all Car IDs and Time
tmap_mode("view")
all_stops <- gpstracking %>%
group_by(id) %>%
mutate(stop = TimeStampFormatted - lag(TimeStampFormatted)) %>%
mutate(parked = ifelse(stop >60*4, TRUE,FALSE)) %>%
ungroup() %>%
filter(parked == TRUE) %>%
distinct(lat, long)
#Converting it to sf
all_stops_sf <- st_as_sf(all_stops,
coords = c("long", "lat"),
crs = 4326) %>%
mutate(coordinates = geometry)
#Viewing it on the map
tm_all_stops <- tmain +
tm_shape(all_stops_sf) +
tm_dots(size = 0.1,
alpha = 0.3,
col = "red")
tm_all_stops
We can observe that there is clustering effect of points, indicating that our rounding of the coordinate points is sufficient to remove variation noise, yet maintain the distinction between different spatial points, as there continues to be points scattered around the map.
With all of the above codes, the following function print_routes_ID_date(emp_id, start_dt,end_dt) is created to generate the routes of a specific vehicle, during a specific time period. This way, we are able to print the routes for a specific vehicle ID, for a selected time period from start_dt to end_dt.
print_routes_ID_date <- function(emp_id,start_dt,end_dt){
#filter gps_path by ID and datetime
#Start with gpstracking, filter
id_time_select <- gpstracking %>% as_tbl_time(index=TimeStampFormatted) %>%
filter(id == emp_id) %>%
filter_time(start_dt ~ end_dt)
#Convert to selected sf
gps_sf_selected <- st_as_sf(id_time_select,
coords = c("long", "lat"),
crs = 4326) %>%
mutate(coordinates = geometry)
#Create a LineString
gps_path_selected <- gps_sf_selected %>%
group_by(id) %>%
summarize(m = mean(TimeStampFormatted),
do_union = FALSE) %>%
st_cast("LINESTRING")
#filter stop points by ID and datetime
stops <- gpstracking %>% as_tbl_time(index=TimeStampFormatted) %>%
group_by(id) %>%
filter(id == emp_id) %>%
filter_time(start_dt ~ end_dt) %>%
mutate(stop = TimeStampFormatted - lag(TimeStampFormatted)) %>%
mutate(parked = ifelse(stop >60*4, TRUE,FALSE)) %>%
ungroup() %>%
filter(parked == TRUE) %>%
distinct(lat, long)
#Converting stop points to sf
stops_sf <- st_as_sf(stops,
coords = c("long", "lat"),
crs = 4326)
#Viewing it on the map
mapviz <- tm_all_stops +
tm_shape(gps_path_selected) +
tm_lines(lwd = 3) +
tm_shape(stops_sf) +
tm_dots(size = 0.1,
alpha = 0.3,
col = "green")
return(mapviz)
}
Example of printing the route for car ID #31, on a specific date time period - 18 Jan.
print_routes_ID_date(31,'2014-01-18 00:00:00','2014-01-18 23:59:00')
Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?
From the bar plot above, by credit card transactions, Katerina’s Cafe, Hippokampos, Guy’s Gyros and Brew’ve Been Served are the most popular locations with the highest number of transactions over two weeks. We shall see the outcome when the loyalty card transactions are analyzed in the same manner. From the bar plot below, the same results are derived for the Top 4 most popular locations.
However, the bar plots only tell us the total number of visits over the 2 weeks. We do not know which days have higher visits, or the profile of the visits over the day. To view this, we can use violin plots. The thickness of the plots will also reflect the frequency of the visit at that point in time. We will also be able to see which days did the visits take place.
We shall use the violin plots in the plotly package as they are interactive, we can hover over the plots, especially where they are thicker to see what time periods are those. There are also boxplots to tell us what the distribution and spread is like.
From the violin plot above, we can see that the top 3 frequented locations identified above (Katerina’s Cafe, Hippokampos, Guy’s Gyros) have a pretty consistent visit rate daily throughout the 2 weeks from 6th to 20th Jan. The thickness of the violin plot is consistent. This means that they are popular throughout the 2 weeks. The 4th frequented location Brew’ve been Served has two clear bulges, indicating there is a lull period between the busy periods.
However, it is observed that other locations can have higher visiting rates at specific periods than the identified Top 3. For example, the visiting rates for Desaflo Golf Course was much higher specifically on 19th Jan, same for Kronos Pipe and Irrigation on 7th Jan. Hallowed Grounds also had peak visiting rates on 8th and 15th Jan.
The violin plot showed a good overview that the visiting days vary across the locations. Now we shall dissect this further into day and time.
We shall use the geom_tile plot from ggplot to visualize this, wrapped with ggplotly() to provide us with interactivity as we would be able to hover over each square and see the corresponding location and date/time. One tile plot would be for “Visits at Each Location for Each Day” and the other for “Visits at Each Location for Time of Day”.
We would first prepare the dataset for this plot by extracting the date and time data into different columns, then do a count for them respectively. As the timestamp has been formatted into POSIXct format earlier, R already stores it in dttm (date-time) format. We can just use the as_date and as_hms functions from the lubridate package to extract them out. We should retain them in date and time format for easier manipulation later.
creditcard$date <- as_date(creditcard$TimeStampFormatted)
creditcard$time <- hms::as_hms(creditcard$TimeStampFormatted)
Next, we would count the number of visits to each location for each Day and time period. While the Date variable is more discrete and easier to count, more work needs to be done to count for the Time variable as it would not make sense to count how many visits are there when the Time value is detailed to the second. Hence, we will round the timestamp down to the nearest half hour for this count, via the round_date() function.
#Count by Day
cardbyday <- creditcard %>%
group_by(date) %>%
add_count(date, location) %>%
distinct(date, location, n)
loyalcardbyday <- loyaltycard %>%
group_by(TimeStampFormatted) %>%
add_count(TimeStampFormatted, location) %>%
distinct(TimeStampFormatted, location, n)
#Count by Time
#Round the time to the nearest half hour
cardbytime <- creditcard %>%
group_by(thirtymins = hms::as_hms(round_date(TimeStampFormatted, "30 mins"))) %>%
add_count(thirtymins, location) %>%
distinct(thirtymins, location, n)
With the data prepared, we shall do the tile plot via geom_tile() from ggplot, then wrapped by ggplotly() for interactivity.
This would be for the “Visits at each Location for Each Day (Credit Card)”.
This is the same plot plotted via loyalty card data.
Next, this would be for “Visits at Each Location for Time of Day”, which can only come from the credit card transactions dataset as transaction time information is not present in loyalty card dataset.
From the plots above, we could see that Katerina’s Cafe, Hippokampos, Guy’s Gyros and Brew’ve Been Served have the lightest bars across the 2 weeks. This shows that they are consistently popular. We also note that Guy’s Gyros is more popular during weekdays than weekends, and Brew’ve Been Served only has visits on weekdays, indicating that it is probably not open on weekends.
| No. | Location | Periods Popular | Remarks |
|---|---|---|---|
| 1 | Katerina’s Cafe | All Days ; Lunchtime (1300 hrs - 1430 hrs), Dinnertime (1900 hrs - 2130 hrs) | NIL |
| 2 | Hippokampos | All Days ; Lunchtime (1230 hrs - 1430 hrs), Dinnertime (1930 hrs - 2230 hrs) | NIL |
| 3 | Guy’s Gyros | More Popular During Weekdays ; Lunchtime (1300 hrs - 1430 hrs), Dinnertime (1900 hrs - 2130 hrs) | Unusually crowded on 19 Jan (Sun) compared to the previous Sunday and weekend crowd |
| 4 | Brew’ve Been Served | Weekday Mornings ; Morning Coffee (0730 hrs - 0830 hrs) | Huge Crowd at 0800 hrs ; Seems closed on weekends |
We note that the above popular locations are all identified on the map of Abila, and are located near Gastech, all except for Hippokampos, which is nowhere to be found on the map. This is an anomaly. To be able to solve the mystery of where Hippokampos is, we would have to delve into other data like vehicle tracking. We would need to match transaction timings with gps tracking data, the coordinates matching at that time would tell us the position of Hippokampos.
U Pump is a refuel station, yet it is only visited twice from the credit card transactions, on two separate days - 6th and 13th Jan. The frequency of refuel is unusually infrequent. It could be due to employees not needing to pay for the fuel using their own credit cards. But strange thing is the loyalty card is also not used either. The same goes for Frank’s Fuel, there were only transactions on 2 days - 8th and 18th Jan. This is just an abnormal situation, which we hope to seek further insights later. Or it could be that there were visits, but purchases were neither paid by credit card nor loyalty card was applied. This would be suspicious, as though the purchasers wanted to keep their tracks secret. We can similary look at gps tracking data to see if there were stops at the fuel stations.
Supermarkets should have daily transactions. However, according to credit card transaction data, Kronos Mart and General Grocer can have 3 - 5 days without any transactions within the 2 weeks. Maximum number of transactions only goes up to 3 for Kronos Mart, and 2 for General Grocer for one day each. Business seems awfully bad.
When we compare this with loyalty card transaction data, there are still days without any transactions, but we see the following differences:
| Date | Credit Card Data | Loyalty Card Data |
|---|---|---|
| 9 Jan | No transaction | 1 transaction |
| 10 Jan | 1 transaction | No transaction |
| 12 Jan | 1 transaction | 2 transactions |
| 13 Jan | 2 transactions | No transaction |
| 14 Jan | 1 transaction | No transaction |
| 15 Jan | No transaction | 1 transaction |
| 17 Jan | 1 transaction | No transaction |
| 18 Jan | No transaction | 3 transactions |
| 19 Jan | 3 transactions | No transaction |
This means that there are transactions paid by cash, not paid by credit card. There were also transactions made without using loyalty card. But business still seems abnormally bad. We would use gpstracking data to see if there were actually any visits to the supermarkets, just that transactions were paid in cash.
Here, we see that the transaction with the highest amount of $10,000 is made at Frydos Autosupply n More. It is also very different from its normal range, an extreme outlier. We shall note this point.
Extreme outliers are also detected at
We shall also note the highest value purchases which are greater than $4000, with the exception of the highest transaction at Stewart and Sons Fabrication, as that is near to its normal range.
Details of these transactions are tabulated as follows:
We note that owner of credit card 4530 (and loyalty card L8477) and 2276 (and loyalty card L3317) own two of these high transaction amounts each. It is also strange that the purchaser behind the huge transaction of $4918.39 at Abila Airport did not use his/her loyalty card to get further discounts on such a big ticket purchase.
Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?
We need to be able to find some points which distinguishes Hippokampos from the other locations. I noted from the Visits at each Location for Time of Day plot that only Hippokampos had a transaction happening at around 2230 hrs.
Hence, we shall filter the gps data by this time period, to see where the possible stops are at.
The green dots show all stops made during the time period from 2200 hrs to 2300 hrs. All these are possible locations of Hippokampos. I note that some are linked to known and labeled locations such as Kronos Mart, some are near places of residences. The only two points which do not belong in these categories are as marked below - along Arkadiou Street, and at GasTech. The stop along Arkadiou Street is also deduced to not be likely, because there were no other stop points surrounding it. If it is a parking place for Hippokampos, there should be way more aggregate points. Therefore, it is drilled down that Hippokampos is likely to be located in GasTech, probably as a canteen of sorts. We can also observe this from the transactions at time of day plot, the peak transaction time is around 1330 hrs - lunchtime, and transactional periods are distinctly from 1230 - 1430 hrs, 1930 - 2230 hrs (lunch and dinnertime). No transactions outside of this period. Hippokampos is located within GasTech.

Credit card and loyalty card transactions show that there are only a handful of transactions each day, or sometimes no business at all, making it seem like the stores and business places are empty.
After visualizing the routes of individuals, we notice that people make stops at the place very regularly but the transaction numbers do not reflect that.
For example, Elsa Orilla would pop by Coffee Shack on her way to GasTech every week day, and stay there for at least half hour each time. She visited on 10 Jan as well, but there were no transactions made via credit card there on that day.

This shows that looking at transaction data via credit card/loyalty card alone is insufficient to show whether a location is popular or not. There could be transactions made by cash as well, though we are not sure why would the people not make use of discounts from loyalty cards. This would apply to the unusually low number of transactions at refuel stations and supermarkets as well, resolving the anomaly in Q1.
Next, we will use the function below to do a plot of travel time periods on each day, for each personnel. This is to aid further investigation. A sample of the plots for 6 personnel is as shown.